RED WINE EDA BY AMONAH ALI
Red wine dataset is being analysed to find the variable that has the most affecting on the wine quality
## [1] 1599 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
quality values is distributed between 3 and 8 also a mean of 5.6 and a median of 6
looking at all other variable to see if it is affect the quality or not
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
volatile.acidity values is distributed between 0.12 and 1.58 also a mean of 0.527 and a median of 0.520
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH values is distributed between 2.7 and 4.01 also a mean of 3.311 and a median of 3.310
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
residual.sugar values is distributed between 0.900 and 15.500 also a mean of 2.539 and a median of 2.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
density values is distributed between 0.9901 and 1.0037 also a mean of 0.9967 and a median of 0.9968
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
fixed.acidity values is distributed between 4.60 and 15.90 also a mean of 8.32 and a median of 7.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
free.sulfur.dioxide values is distributed between 1 and 72.00 also a mean of 15.87 and a median of 14.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
sulphates values is distributed between 0.3300 and 2.0000 also a mean of 0.6581 and a median of 0.6200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
alcohol values is distributed between 8.40 and 14.90 also a mean of 10.42 and a median of 10.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
total.sulfur.dioxide values is distributed between 6.00 and 289.00 also a mean of 46.47 and a median of 38.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
chlorides values is distributed between 0.01200 and 0.61100 also a mean of 0.08747 and a median of 0.07900
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
citric.acid values is distributed between 0 and 1 also a mean of 0.271 and a median of 0.260
1599 of wine data and a number of 12 variables, the main variable is quality with a range of 3 to 8 but it shown that there is a number of samples with a range between 5 and 6 also most diagrams is right skew. ### What is/are the main feature(s) of interest in your dataset? the main feature is quality because wine is being catogrized for wine taker based on its quality level either it is heigh, average and low and their are other variables that influnce the quality. ### What other features in the dataset do you think will help support your into your feature(s) of interest? their is a number of features that will help in the investigation such as Alcohol, citric acid, volatile acidity, sulphates ### Did you create any new variables from existing variables in the dataset? wine level variable is being created for knowing the level of the wine quality and it is divided into 3 level high, average and low
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ quality.lvl : chr "average" "average" "average" "average" ...
the plot displays the quality level and that a large number of the quality level is at average
the data is tidy and only a new column were added that is wine level for the quality of the wine also x variable were deleted for not being useful in the dataset
the first factor to look at is the alcohol after that others like volatile acidity, sulphates and citric acid will be look at also look at the correlation of factors
above plots shows that higher alcohol makes a better wine quality but a look at other variables relationships with quality is must because not only alcohol affect the quality
the quality increase when the sulphates increase but it also shows that at the high level of sulphates it negativly affect the quality
the plot shows that volatile.acidity decreese with quality increasing so it is an Inverse relationship
the quality level increase with citric acid increase
there is a little correlation between alcohol and sulphates
no correlation between alcohol and citric acid.
a strong correlation between total sulfur dioxide and free sulfur dioxide in a positive way
a strong correlation between between fixed acidity and citric acid in a positive way
sulphates and chlorides shows cluster
1)chlorides and sulphates has an interesting relationship 2)a strong positive correlation between citric acid and fixed acidity is founded
the strongest relationship is the alcohol
looking at the relationships of factors with focusing on the quality as a color
it shows with higher sulphates it produce higher alcohol
density does not affect the alcohol
an Inverse relationship between volatile.acidity and alcohol when alcohol increase volatile.acidity decrease
the alcohol increase when citric acid increase
the same result in volatile.acidity and alcohol with the pH and alcohol which is alcohol increase when pH decrease
total.sulfur.dioxide decrease produce high alcohol
no correlations between fixed.acidity and volatile.acidity
volatile.acidity decrease and citric.acid increase produce high quality
no correlations between fixed.acidity and citric.acid
a range of sulphates and chlorides will produce high quality of wine
the interesting interaction is between the chlorides and sulphates
the first plot is for the quality of the wine which is the main feature in this dataset and it shows a high number of values for 5, 6 and 7 and it shows a low number of values for 3 and 4
the third plot shows the sulphate and alcohol is high when the wine quality is high and the alcohol and sulphate together make a high quality wine and affect the quality posivitly
1599 and 12 variabled in the red wine dataset. an analysis was contucted with starting with the main feature in the dataset which is quality because a wine in nothing without its quality also other factors were analysed individualy, after that going to the next important factor that affect the quality is the alcohol also there is other factors that affect the quality that are citric acid, volatile acidity and sulphates. A strong positive correlation between total and free sulfur dioxide and fixed acidity and citric acid has been founded also a strange realtionship between sulphates and chlorides that shape a cluster. at the multivariate plots a relationship of other factors with focusing on the quality as a color and it has been contucted that alcohol is the main factor that affect the quality and others factors that help were citric acid, volatile acidity and sulphates at the end the limitation of this dataset is that is a large range of quality between 5 and 6 also there is no quality level is given that why it has been created that why in future work i hope to analyse a dataset similar to red wine dataset that has a classification or a level on the quality of the wine.